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This paper presents a basic enhancement to the DeSTIN deep learning architecture 
by replacing the explicitly calculated transition tables that are used to capture 
temporal features with a simpler, more scalable mechanism. This mechanism uses 
^ feedback of state information to cluster over a space comprised of both the spatial 

input and the current state. The resulting architecture achieves state-of-the-art 
results on the MNIST classification benchmark. 
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I— —I 1 Introduction 

CN 

^ We introduce an enhancement to DeSTIN H] that is aimed at simplifying the architecture and im- 

IjD proving its ability to capture temporal features. The DeSTIN architecture consists of multiple instan- 

OO tiations of a common node arranged into layers. Using the recurrent clustering algorithm presented, 

CO spatiotemporal dependencies are naturally captured throughout the hierarchy. This recurrent cluster- 

ed ing algorithm replaces the earlier incremental clustering algorithm and state-transition table which 

previously constructed the core of the DeSTIN node. Each node operates independently and in par- 
allel to all others which makes this architecture particularly well-suited to implementation in parallel 
CO hardware. The changes made here make the task of a parallel implementation, whether it be on a 

^-H GPU or in custom analog circuitry, much more achievable. 

2 Recurrent Clustering Algorithm 

In this section we will describe a recurrent incremental clustering algorithm that forms centroids 
on a space comprised of a concatenation of the current spatial input and the previous belief state. 
The belief state represents the probability that the current combination of external observation and 
internal behef state belongs to each of the centroids. 

2.1 Incremental Clustering 

The core of this recurrent clustering system is a winner-take-all clustering algorithm that maintains 
estimates for the mean, /i, and a variance, u^, for each dimension of each centroid fT\. These are 
updated according to eqs. ([T]) and (|2]l. In these equations o is the current observation, x is the 
centroid being updated, and < a, (3 < 1 are learning rates. In order to handle cases of poor 
centroid initialization, a mechanism called starvation trace is employed to help select the centroid to 
be updated, as outlined in eqs. (|3]l and Q. The starvation trace, V', decays with time when a centroid 
is not selected for an update. This value is used to weight the distance between the corresponding 
centroid and the observation, such that centroids which are distant from all observations can be 
updated. 
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Figure 1 : In recurrent clustering the belief state is concatenated with the current input to form the 
space over which clustering is performed. 



fix = afix + - a){o - (1) 
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x = argmiucgc [V'c II o-A^c II] (3) 

^/'c=70c + (l-7)lx=c (4) 



2.2 Belief State Formulation 

Once the selected centroid is updated, the elements of the belief state, be, are updated using the 
normalized Euclidean distance between the d-dimensional input vector, o, and each centroid, c, in 
the set of centroids, C, as shown in eqs. (|5]l and (|6|. 

E[Oi - fJ,c,i 
^2 (5) 

be = (6) 

c'GC 

2.3 Feedback Construct 

This belief state is used as a portion of the input to the core clustering algorithm, as depicted in 
Figure [T] Introducing this feedback raises the concern of balancing the importance of the spatial 
input and the temporal belief in the selection algorithm. If the temporal component is not given 
adequate weight, this feedback loop will have no effect. If, however, it is given too much weight, 
the spatial element may be ignored. It was sufficient to employ a simple Euclidean distance measure 
to select the winning centroid. However, if the distributions of the spatial input and the belief states 
are vastly different, it may be necessary to use a normalized Euclidean distance with a constant 
normahzation vector, the elements of which serve as weights for each dimension. 

3 Simulation Results 

Here we report on a series of experiments designed to demonstrate the ability of the revised DeSTIN 
architecture to represent temporal information. First, we explore the recurrent clustering algorithm's 
abilities on a simple sequence analysis test case, and then the performance of the algorithm in a full- 
scale architecture is demonstrated on a couple of standard benchmarks. 

3.1 Recurrent Clustering for Sequence Detection 

It is important that the recurrent clustering be able to capture regularities across time. In order to 
demonstrate this capability, binary sequence detection tasks were studied with varying lengths for 
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Figure 2: Binary Sequence Detection: This plot shows average classification accuracy for varying 
lengths of sequences. 

the sequence of interest. The clustering algorithm is presented with equal probability either the 
sequence of interest, or a sequence identical to it with only the first element inverted. The belief 
states for each of the sequences were provided to a feed-forward neural network for the purpose of 
classifying each sequence. Figure |2] illustrates an exponentially-decaying relationship between the 
classification accuracy and the length of the sequence of interest. This is to be expected given that 
there is no supervision to guide the algorithm in capturing temporal dependencies of any specific 
length. The results highlight that if too few centroids are considered, the algorithm has insufficient 
resources to represent dependencies that span longer time intervals. If too many centroids are used, 
the belief state may capture features in the data not relevant to identifying the sequence of interest. 
However, such sensitivity to the number of centroids is desirably weak. 

3.2 MNISTDataset 

The revised clustering scheme was next used in a full-scale architecture and applied to the MNIST 
dataset classification problem |3|. The DeSTIN hierarchy used was made up of 3 layers of nodes 
with each layer consisting 4x4, 2x2 and 1 nodes from bottom to top. The bottom layer viewed a 
16x16 window that is shifted over the image, with each bottom layer node receiving a unique 4x4 
patch. The bottom, middle, and top layer nodes used 32, 24, and 32 centroids, respectively. The 
hierarchy was then trained on 15, 000 of the training set images. Next, all training images and testing 
images were provided to the hierarchy in order to generate feature vectors. The feature vector for 
each image consisted of each node's belief state sampled at every 12th movement. These feature 
vectors were provided to an ensemble of 11 feed-forward neural networks trained with negative 
correlation learning. Each FFNN consisted of two hidden layers, with 128 and 64 hidden neurons 
in the first and second layer respectively. A classification accuracy of 98.71% was achieved which 
is comparable to results using the first-generation DeSTIN architecture [IJ and to results achieved 
with other state-of-the-art methods ||4]|5]E1- 
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